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Abstract 

We consider a class of pattern matching problems where a normalising transfor- 
mation is applied at every alignment. Normalised pattern matching plays a key role 
in fields as diverse as image processing and musical information processing where 
application specific transformations are often applied to the input. By considering 
the class of polynomial transformations of the input, we provide fast algorithms and 
the first lower bounds for both new and old problems. 

Given a pattern of length m and a longer text of length n where both are assumed 
to contain integer values only, we first show O(rilogm) time algorithms for pattern 
matching under linear transformations even when wildcard symbols can occur in the 
input. We then show how to extend the technique to polynomial transformations of 
arbitrary degree. Next we consider the problem of finding the minimum Hamming 
distance under polynomial transformation. We show that, for any e > 0, there 
cannot exist an 0{nm^~^) time algorithm for additive and linear transformations 
conditional on the hardness of the classic 3SuM problem. Finally, we consider a 
version of the Hamming distance problem under additive transformations with a 
bound k on the maximum distance that need be reported. We give a deterministic 
0{nk log k) time solution which we then improve by careful use of randomisation to 
0{n\/ k log k log n) time for sufficiently small k. Our randomised solution outputs 
the correct answer at every position with high probability. 

1 Introduction 

We consider pattern matching problems where the task is to find the distance between 
a pattern and every substring of the text of suitable length. In the class of problems 
we consider, the values in the pattern can first be transformed so as to minimise this 
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distance. Further, the selection of which transformation to apply, which is possibly dis- 
tinct for each alignment, forms part of the problem that is to be solved. This class of 
proble ms generalises the well known problem of exact matching with wildcards CHOd . 



CC0 7 | as wel l as the set of problems known previously as transposition invariant match- 



ing [MNU05|, both of which come from the pattern matching literature. However as we 
will see, it is considerably broader than both with applications in both image processing 
and musical information retrieval. 

By way of a first motivation for our work, consider a fundamental problem in image 
processing which is to measure the similarity between a small image segment or template 
and regions of comparable size within a larger scene. It is well known that the cross- 
correlation between the two can be computed efficiently at every position in the larger 
image using the fast Fourier transform (FFT). In practice, images may differ in a number 
of ways including being rotated, scaled or affected by noise. We consider here the case 
where the intensity or brightness of an image occurrence is unknown and where parts 
of either image contain don't care or wildcard pixels, i.e. pixels that are considered to 
be irrelevant as far as image similarity is concerned. As an example, a rectangular 
image segment may contain a facial image and the objective is to identify the face in a 
larger scene. However, some faces in the larger scene are in shadow and others in light. 
Furthermore, background pixels around the faces may be considered to be irrelevant for 
facial recognition and these should not affect the search algorithm. 

In order to overcome the first difficulty of varying intensity within an image, a stan- 
dard approach is to compute the normalised distance when comparing a template to part 
of a larger image. Thus both template and image are transformed or rescaled in order 
to make any matches found more meaningful and to allow comparisons between matches 
at different positions. Within the image processing literature the accepted method of 
normalisation is to scale the mean and variance of the template and image segments. We 
take a slightly different although related approach to normalisation which will allow to 
us to show a number of natural generalisations. 

We start by defining measures of distance between a pattern P and text T, where P 
is a string of length m and T is a string of length n ^ m, both over the integers. The 
squared L2 or Euclidean distance between the pattern and the text at position i is 

m—l 

[Plj] - T[^ + jf . 

j=0 

In this case, for each i G {0, . . . ,n — m}, the pattern can be normalised, or fitted as 
closely as possible to the text, by transforming the input to minimise the distance. 

In the case of degree one polynomial transformations, the normalised L2 distance 
between the pattern and the text at position i can now be written as 

m— 1 

mmy2 {a + pP[j]-T[i + j])\ 
where the minimisation is over rational values of a and /3. The minimisation is per 
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alignment of P and T, hence the values of a and /3 may (and probably will) differ 
between the positions i. 

We also consider the case when the input alphabet is augmented with the special 
wildcard symbol, denoted A position where either the pattern or text has a wildcard 
will not contribute to the distance. That is, the minimisation is carried out using the 
sum of the remaining terms. Details are given in the problem definitions in the next 
section. 

1.1 Problems and our results 

The words shift and scale are used to refer to additive and multiplicative transformations 
of the pattern, respectively. The input to all our problems is a text T of length n and a 
pattern P of length m, and the output is a problem specific distance d{i) between P and 
T at every position i £ {0, . . . , n — m} of the text. To avoid overloading variable names, 
we give the distance d{i) a unique name for each problem. 

Problem 1 (SHIFT-Lg). Normalised L2 distance under shifts. Wildcards are allowed. 

m—1 

4(i) min ^ (a + P[j] - T[i + j])' . 
j=0 

When either P[j] = -k or T[i + j] = -k, the contribution of the pair to the sum is 
taken to he zero. The minimisation is carried out using the sum of the remaining terms. 

Next we define the normalised L2 distance under shifts and scaling, corresponding to 
a degree one polynomial transformation of the values of the pattern. 

Problem 2 (SHIFTSCALE-Lg)- Normalised L2 distance under shifts and scaling. Wild- 
cards are allowed. 

m—1 

4(i) = min ^ (a + /3P[j] - T[i + j])' . 

When either = -k or T\i -\- j\ = the contribution of the pair to the sum d\(i) is 
taken to be zero. The minimisation is carried out using the sum of the remaining terms. 

We show that both SHIFT-L2 and ShiftScale-Lj can be solved in 0(n log m) time 
by the use of FFTs of integer vectors. Our results are stated in Theorems II 1 and II 1 [ We 
assume the RAM model of computation throughout in order to be consistent with pre- 
vious work on matching with wildcards. Further, our techniques also provide 0(n log m) 
time solutions (Theorems [12] and [T3]) to the problems of exact shift matching with 
wildcards (Shift-Exact*) and exact shift-scale matching with wildcards (ShiftScale- 
EXACT*), formally defined as follows. 

Problem 3 (Shift- Exact*). Normalised exact matching under shifts. Wildcards are 
allowed. 

dei \l, 3 a St. a + P[j] = T[i + j] for all j e {0, . . . ,m - 1}; 
^ 1 0, otherwise. 
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Every position j where either P[j] = ★ or T[i + j] = -k is ignored. 

Problem 4 (ShiftScale-Exact*). Normalised exact matching under shifts and scal- 
ing. Wildcards are allowed. The problem is defined similarly to Shift-Exact*, only 
that we check whether there exist a and (3 st. a + pP[j] = T[i + j] for all positions j 
(except positions where P[i] = -k or T[i + j] = 

We will also discuss extensions to pattern transformations under polynomials of higher 
degree in Section [2l In terms of normalised L2 distance we give the following definition. 

Problem 5 (POLY-r-Lg). Normalised L2 distance under degree-r polynomial transfor- 
mation. Wildcards are allowed. Let f{x) = ao + aix + 022;^ + • • • + Orx"^ be a polynomial 
of degree r with r ^ 1 . 

m—l 

def 



dlii) min Y,{fim)-n+. 



ao,...,ar 

3=0 

When either P[j] = -k or T[i -\- j] = -k, the contribution of the pair to the sum d2(i) is 
taken to be zero. The minimisation is carried out using the sum of the remaining terms. 

Note that the problem SHIFTSCALE-Lg is the same problem as Y'OV^-r-L^ with degree 
r = 1. We will show that POLY-r-L2 can be solved in 0(rn log m + r^n) time, where w 
is the exponent for matrix multiplication (e.g., w ~ 2.38 when using the Coppersmith- 
Winograd algorithm). 

The second main topic of our work is on normalised pattern matching problems under 
the Hamming distance. The Hamming distance is perhaps the most commonly considered 
measure of distance between strings in the field of pattern matching. We therefore define 
related normalised versions of our pattern matching problems in a similar way to before. 

Problem 6 (Shift-Ham). Normalised Hamming distance under shifts. Wildcards are 
not allowed. 

d+(i) ''^ min|{i I a + P[i] /r[^ + j]}|. 

Problem 7 (ShiftScale-Ham). Normalised Hamming distance under shifts and scal- 
ing. Wildcards are not allowed. 

4(i) min|{j|a + /3P[i]^T[i + j]}|. 
Previously it has been shown that Shift-Ham, sometimes als o referred to as trans- 



position invariant matching, can be solved in 0{nm\ogm) time |MNU05| . It has been 
tempting to believe that it might be possible to improve this time complexity, particu- 
larly as there exist algorithms for standard non-nor nialised pattern matching under the 



Hamming distance which take 0{ny/m logm) time |Abr87l . |Kos87|. We show by reduc- 
tions from the well known 3SUM problem that for both shift and shift-scale matching 
under the Hamming distance there cannot exist an 0(nm^~^) time algorithm for any 
e > (Theorems [m and [20]). 
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To circumvent this new conjectured lower bound, we consider as our last problem 
a shift version of the A:-mismatch problem. In the A;-mismatch problem, the Hamming 
distance is to be reported at every alignment as long as it is at most k. If it is greater 
than k then the algorithm is only required to report that the Hamming distance is large. 
We define the problem as follows. 

Problem 8 (Shift- fc-MlSMATCH). Normalised k-mismatch under shifts. Wildcards are 
not allowed. 

dtiii) = min((ij^(i), k + 1) . 

We first give a simple deterministic 0{nklogk) time solution (Theorem I24p . We then 
consider a decision version of the problem where we output only the locations i where 
dj^{i) ^ k but not the Hamming distance at those locations. The decision version is 
defined as follows. 

Problem 9 (Shift- A;-Decision). Normalised k-mismatch decision problem under shifts. 
Wildcards are not allowed. 

I 1, otherwise. 

Using randomisation we show how to solve this problem in 0{cny/klogklogn) time 
for the case that k < ym/6 (Theorem [33|) . Here c is a constant that can be chosen 
arbitrarily to fine tune the error probability. Namely, our algorithm outputs the correct 
answer at every alignment with probability at least 1 — We therefore succeed 

in breaking our newly introduced running time barrier provided by the reduction from 
3SUM for a limited range of values of k. 



1.2 Related work 

Combinatorial pattern matching has concerned itself mainly with strings of symbolic 
characters where the distance between individual characters is specified by some conven- 
tion. For the A;- mismatch problem, an 0{nk) time algorithm was given in 1986 that uses 
constant time lowest common ancestor queries on the suffix tree of the patter n and text 



in a technique that has subsequently come to be known as 'kangaroo hopping' LVSfil. AL 



most 20 years afterwards, the asymptotic running time was finally improved in ALP04| | 



to 0{n\/klog k) time by a method based on ffitering, the suffix tree (with kangaroo hop- 
ping) and FFTs. In 2002, a deterministic 0(n l og m) time solution for exact matching 
with wildcards was given by Cole and Hariharan and further simplified in |CC07| . 

In the same paper by Cole and Hariharan, an 0(nlog(max(m, iV))) time algorithm for 
the exact shift matching problem we consider in Section [2] was presented. Here N is the 
largest value in the input. The appro ach we take to provide a simpler solution for this 



problem is similar in spirit to that of |CC07l |. 



There has also been some work in recent years on fast algorithms for distance cal- 
culation and approximate matching between numerical strings. A number of different 
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metrics have bee n considered, •with for ex ample, 0{n\Jm logm) tir ne solu tions found for 
the Li distance [AtaOll . ICCIOSl . lALPU05l | and less-than matching |AF95l | problems and 
an 0(dn log m) time algorithm f or the ( 5 -bound ed version of the L^o norm first discussed 
in |Cin4l | and then improved in |CCin.4 |lP05|. 

The most closely related work to ours comes under the heading of transposition 
invariant matching ^LUOOj . The original motivation for this problem was within musical 
information retrieval where musical search is to be performed invariant of pitch level 
transposition. The transposition invariant distance between two equal lengthed strings 
A and B is defined to be miuo, d{A + a,B), where A + a is the string obtained from A 
by adding a to every value and the distance d between strings can be variously defined. 
Algorithms for transposition invariant Hamming distance, longest c ommon subsequence 
(LCS) and Levenshtein (edit) distance amongst others were given in |MNU05| whose time 
complexities are close to the known upper bounds without transposition. We show, in 
Section [3l lower bounds for the special case of transposition invariant Hamming distance, 
which we named Shift-Ham. Normalised pattern matching is also of central interest in 
the image processing literature where normalisation is typically performed by scaling the 
mean and standard deviation of the template and each suitably sized image segment to 
be and 1, respectively. An asymptotically fast method for perfori ning no rmalised cross- 
correlation for template matching, also using FFTs, was given in |Lew95l | . The methods 
we give in Section [2] have some broad similarity to their approach only in the use of 
FFTs to provide fast solutions. Due to the differences in the definition of normalisation 
between our work and theirs, the solutions we give are otherwise quite distinct. 

As a general class of problems, pattern matching under polynomial transformation 
is to the best of our knowledge new. However, if we allow the degree of the polynomial 
transformation to increase to m, then determining for which alignment the normalised 
distance equals zero is equivalent to the known problem of function matching. Function 
matching has a deterministic 0(n|Sp| logm) time solution, where |Sp| is the size of the 
pattern alphabet, and a fas ter rando mised algorithm which runs in 0(n log n) time and 
has failure probability 1/n [AALP06|. 



1.3 Basic notation 

For a string X of length i, we write X[i] to denote the zth character of X such that 
X = X[0] X[l] X[2] ■ ■ ■ X[i - 1] (the first index is always zero). The s-length substring 
of X starting at position i is denoted X[i . . . i -\- s — 1]. For two strings X and Y, the 
notion X\\Y is used to denote the string formed by concatenating X and Y in that order. 
All strings in this paper are over the integer alphabet. Therefore, X[j]y[j] denotes the 
product of the numerical characters X[j] and Y[j]. If strings X and Y are of equal 
length, we use the notation X ■ Y for the string with characters {X ■ Y)[i] = X[i]Y[i]. 
This element-wise arithmetic is used similarly for addition, subtraction, division and 
power. For example, the ith symbol of X'^ /Y is X[i]^/y[i]. For a real value k, the scalar 
multiplication kX is the string (A:X)[i] = kX[i]. 

The notation Ham(X, Y) will be used to denote the Hamming distance between equal 
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lengthed strings X and Y: 



Ham(X, Y) =^ \ {i\ X[ii / Y[ii } | . 

Throughout this paper we use T to denote the text and P for the pattern. We use n 
to denote the length of T and m for the length of P. 

Our algorithms in Section [2] make extensive use of FFTs. An important property of 
the FFT is that the cross-correlation, defined as 

m—l 
j=0 

can be calcula ted accu rately and efficiently for all i E {0, ...,n — m} in 0(n log m) 
time (see e.g. CLR9d | . Chapter 32). The time complexity is reduced from 0(n log n) 



to 0(n log m) using a standard splitting trick which partitions the text into 2m length 
substrings which overlap each other by m characters. When it is clear from the context 
we use ^ as an abbreviation for Y2Y=0 ■ 

We use "V for the single character wildcard symbol. Under arithmetics on strings, 
as defined above, we may think of a wildcard as having the value zero. This value 
is, however, inconsequential for our purposes, as all expressions in this paper have the 
property that whenever a wildcard symbol is involved in some arithmetics, it is multiplied 
by a zero. 

We write [n] to denote the set of integers {0 . . . n— 1}. We also say that g{n) £ 
if and only if g{n) £ Q (h{n) / log'^ n) for some constant c, i.e g{n) G up to log 

factors. 



1.4 Organisation 

The reminder of the paper is organised as follows. In Section [2] we discuss normalised 
pattern distance under L2 distance (SHIFT-L2 and SHIFTSCALE-Lg) and the decision 
variants (SHIFT- EXACT* and ShiftScale-Exact*). We also show how to extend the 
methods to transformations of higher degree polynomials (POLY-r-Lg). Then in Sec- 
tion [3] we give running time lower bounds for Shift-Ham and ShiftScale-Ham by 
reduction from the 3SUM problem. In Section U] we introduce our new deterministic 
and randomised algorithms for Shift-Zc-Mismatch and Shift-A;-Decision. Finally, we 
conclude in Section [5] and set out some open problems. 



2 Normalised L2 distance 

We give 0(n log m) time solutions for shift and shift-scale versions of the normalised L2 
distance problem with wildcards. We further show that this enables us to solve the exact 
shift matching and exact shift-scale matching problems in the same time complexity for 
inputs containing wildcard symbols. Lastly we show how to extend our solutions to 
normalisation under polynomials of arbitrary degree. 
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Algorithm 1 Solution to SHIFT-L2. 



1. Construct P' from P such that P'[j] = if P[j] = -k, and P'[j] = 1 otherwise. 
Construct T' from T similarly. 

2. Compute the following six cross-correlations: 

Ci = • T') ® P' C3 = T'0 (P2 . p') = T' ® (P • P') 

C2 = {T- T') ® (P • P') C4 = (T • T') P' = T' P' 

3. Return A = C1-2C2 + C3- ((C4-C5)VC6). We have = A[i]. For positions i 
where Cq[i\ = we have = 0. 



2.1 Normalised L2 distance under shifts 

In order to handle wildcards, we define two new strings P' and T' obtained from P and 
T, respectively, such that P'[j] = if P[j] = and P'[j] = 1 otherwise. Similarly, 
T'[i] = if T[i] = -k, and T'[i] = 1 otherwise. We can now express the shift normalised 
L2 distance at position i as 

m—l 

4{i) = mm^ ((a + P[j]-r[i + j])'.P'[j].r'[f+i]). 

i=o 

Algorithm [1] shows how to compute d^ii) for all positions i. Correctness and running 
time is given in the following theorem. 

Theorem 10. The shift version of the normalised L2 distance with wildcards problem 
(^SHIFT-Lgy) can be solved in 0(n log m) time. 

Proof. Consider Algorithm [TJ We first analyse the running time. Step 1 requires only 
single passes over the input. Similarly, (P^ • P'), (P • P'), (T • T') and (T^ • T') can all 
be calculated in linear time once T' and P' are known. Using the FFT, the six cross- 
correlations in Step 2 can be calculated in 0(?7-logm) time. The final vector of Step 3 is 
obtained in linear time. Thus, 0(n log m) is the overall time complexity of the algorithm. 
To show correctness we consider the minimum value of 

1 

m = 5]((a + P[i]-r[. + j])'-P'[j]-r'[. + j]). (1) 

j=0 

This can be obtained by differentiating with respect to a and obtaining the minimising 
value. Solving 

rj .r.-j m-1 

= 2 ^ ((a + P[j] - T[z + j]) . P'ij] . T'[^ + j]) = 
i=o 
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Algorithm 2 Solution to SHIFTSCALE-Lg. 



1. Construct P' from P such that P'[j] = if P[j] = -k, and P'[j] = 1 otherwise. 
Construct T' from T similarly. 

2. Compute the following six cross-correlations: 

Ci = • T') ® P' C3 = T'0 (P2 . p') = T' ® (P • P') 

C2 = {T- T') ® (P • P') C4 = (T • T') P' Ce = T' P' 

3. Compute 

C4 ■ C5 

-Bi = C3 • C4 — C2 • C5 , P2 = ^3 • Ce — C5 , Bi = C2 — , P4 = C3 — — p 

and compute a = B1/B2 and /3 = P3/P4. At positions i where CqIi] = 0, set 
a[i] = f3[i] = 0. At positions i where B2[i] = and Ce[i] / 0, set a[i] = C^/Cq and 

m = 0. 

4. Return B = {o? ■ Cq) + 2(a • ^ • C75) - 2(q • C74) + 0^ ■ C3) - 20 ■ C2) + Ci. 
We have 4(^) = 



gives us the value 

E + j] - P[j]) ■ P'[j] ■ T'[i + j]) _ ((^ . ^ p.) _ ((p . p/) ^ p/) 



where a[i] is the minimising value at position i. Substituting a = a into Equation ([T]), 
expanding and collecting terms, we obtain the final answer as 

A = C,-2C2 + Cs- ^^'~^'^\ 

where Ci , . . . , Cg are the correlations defined in Algorithm [TJ 

Lastly we observe that when C^li] = {T' (gi P')[i] = there is a wildcard at every 
position in the alignment of P and T. Here the shift normalised L2 distance is defined 
to be 0. □ 



2.2 Normalised L2 distance under shift-scale 

Similarly to the shift version of normalised L2 distance in the previous section, we can 
now solve the shift-scale version. The solution is slightly more involved but the running 
time remains the same. Algorithm [2] sets out the main steps to achieve this and the 
result is summarised in the following theorem. 

Theorem 11. The shift-scale version of the normalised L2 distance with wildcards prob- 
lem (^SHIFTSCALE-Lgy) can be solved in 0(n log m) time. 
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Proof. Consider Algorithm [2l Notice that the same six correlations as in Algorithm [T] 
have to be calculated. The additional strings in Step 3 require linear time, as well as 
producing the output in Step 4. Hence the overall running time is 0(n log m). 

Similarly to Equation ([1]) we can express the shift-scale version of the normalised L2 
distance at position i as 

m—l 

B[i] = Y^(^{a + f3P[j]-T[i + j]f-P'[j]-T'[i+j]). (2) 

j=0 

By minimising this expression with respect to both a and /3 we get a system of two 
simultaneous linear equations 



dBli] 



m—l 



da ^ 

j=0 

m—l 



2 > (a + /3P[j] - T[i + j]) • P'[j] . r'[i + i] = , 



dp 



2 ^ ((a + pP[i\ - T[i + j]) • P[3\ ■ P'[j\ ■ T'[i + j]j = . 
i=o 



By solving this system and using the definitions of , . . . , i?4 in Algorithm [2l we get the 
minimising values 

a = — and P = ^ • 

For some positions i, the solution to the system might not be unique. This happens at 
alignments i for which every position i+j has a wildcard, hence CqIi] = 0. Here we avoid 
illegal division by zero by simply setting both a[i] and f3[i] to zero (any value would do). 
A non-unique solution also occurs at alignments i where all P[j] are identical over every 
non-wildcard position i -\- j. This is characterised by B2[i] = 0. To see this, observe that 
Csfi]^ ^ C3[i]C6[i] by Cauchy-Schwarz inequality. Here we set (arbitrarily) f3[i] = and 
therefore obtain the minimising value a[i] = C^/Cq. 

At Stage 4, a and /3 contain the minimising values for a and /3 at every position. We 
substitute these into Equation ^ and expand. This gives us the expression for B. □ 

2.3 Exact shift and shift-scale matching with wildcards 

For the exact shift matching problem with wildcards. Shift- EXACT*, a match is said 
to occur at location i if, for some shift a and for every position j in the pattern, either 
a + P[j] = T[i + j] or at least one of P[j] and T[i + j] is the wildcard symbol. Cole 
and Hariharan [CH02j introduced a new coding for this problem that maps the string 
elements into for wildcards and complex numbers of modulus 1 otherwise. The EFT 
is then used to find the (complex) cross-correlation between these coded strings, and 
finally a shift match is declared at location i if the ith. element of the modulus of the 
cross-correlation is equal to (P' (gi r')[f]. 

Our Algorithm [1] provides a straightforward alternative method for shift matching 
with wildcards. It has the advantage of only using simple integer codings. Since Algo- 
rithm [1] finds the minimum L2 distance at location i, over all possible shifts, it is only 
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necessary to test whether this distance is zero. The running time for the test is then 
0(n log m) since it is determined by the running time of Algorithm [TJ 

Theorem 12. The problem of exact shift matching with wildcards (^Shift-Exact*J can 
be solved in 0(n log m) time. 

The exact shift-scale matching problem with wildcards, ShiftScale-Exact*, can 
be solved similarly by applying Algorithm [2] 

Theorem 13. The problem of exact shift-scale matching with wildcards (^ShiftScale- 
EXACT^J can be solved in O(nlogm) time. 



2.4 Normalised L2 distance under higher degree transformations 

We can now consider the problem of computing the normalised L2 distance under general 
polynomial transformations. The problem, which we termed POLY-r-L2, was defined in 
Problem [5l Recall that we let 

f{x) = ao + aix + a2X^ + • • • + OrX^ 

be a polynomial of degree r ^ 1. Similarly to the shift and shift-scale versions of the 
normalised L2 distance we consider the minimum value of 

m— 1 

m = E {if™ - n+jf ■ p'[j] ■ T'[i+j]) . (3) 

j=0 

By differentiating with respect to each in turn, giving 

^ = 2f^{{f{P[j])-T[i + j])-P[j]'-P'[j]-T'[i + j]) = 0, 

j=0 

we obtain a system of r + 1 linear equations in r + 1 unknowns for each alignment i of the 
pattern and text. We need to solve these equations and then substitute the minimising 
Ofc values back into Equation ([3]) as we did in the proof of Theorem [TTJ This procedure 
is captured by the following theorem. 

Theorem 14. The normalised L2 distance problem with wildcards under polynomial 
transformations of degree r (POLY-r-L2) can be solved in 0(rn log m + r^'^^n) time. 

Proof. To compute the coefficients for the first linear equation for ao we need to perform 
0(r) cross-correlations. However, for each subsequent equation for ai . . . a,- we only need 
to perform a constant number of new cross-correlations. Therefore the total number of 
cross-correlations is 0{r) to give the coefficients of all the equations, taking 0(rn log m) 
time overall. The time to solve the systems of 0(r) equations in 0{r) unknowns is 0{r^) 
per alignment i, where w is the exponent for matrix multiplication. This giv es 0{nr^) 
time or 0(nr^'^^) using the algorithm of Coppersmith and Winograd (CW90| . 
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Once the equations have been solved, and the minimising values of calculated, they 
are then substituted into the expression for D in Equation ([3]). To calculate the final 
values D[i\ we require 0{r) cross-correlations to be computed as well as O(r^) products 
of vectors of length m. The overall time complexity is therefore 0(rn log m + r^'^^n + 
r'^m). □ 

This method is of particular relevance for low degree polynomials, or at least poly- 
nomials whose degree is less than the number of distinct values in the pattern. How- 
ever, if the degree r is greater than the number of distinct values in the pattern, then 
there exists a suitable polynomial / for any mapping we should choose. This gives us a 
straightforward 0{nm) time solution by considering each position of the pattern in the 
text independently and ignoring any values aligned with wildcards in either the pattern 
or text. For each such position we need only set f{P[j]) to be the mean of the values in 
the text that align with a value equal to in the pattern. 



3 Lower bounds for Hamming distance 

In this section we will show that no 0{nm}^'^) time algorithm can exist for neither Shift- 
Ham or ShiftScale-Ham conditional on the hardness of the classic 3SUM problem. One 
formulation of the 3SUM problem is given below. 

Definition 15 (3Sum). Given a set of s positive integers, determine whether there are 
three elements a, 6, c in the set such that a + h = c. 

The 3SUM problem can be solved in O(s^) time and it is a long standing conjecture 
that this is essentially the best possible. The pr oblem has been extensively discussed 



in the literature, where Gajentaan and Overmars [G095| were the first to introduce the 
concept of 3SUM-hardness (see definition below) to show that a wide range of problems 
in computational geometry are at least as hard as the 3SUM problem. One example is 
the GeomBase problem, defined below, which we will use in one of our reductions in 
this section. See [KinO^I for a survey of problems from computational geometry whose 
hardness relies on that of 3SUM. 

Definition 16 (GeomBase). Given a set of s points with integer coordinates on three 
horizontal lines y = 0, y = 1 and y = 2, determine whether there exists a non-horizontal 
line containing three of the points. 

Although an ri{s'^) lower bound for 3SUM is only conjectured, it has been shown that 
under certain r estricted models of computation, r2(s^) is a true lower bound (see 
Eri99al . lEri99b| ). Under models that allow more direct manipulation of numbers instead 
of just real arithmetic, such as the word-RAM model, an almost log^ s factor improvement 
to the standard O(s^) solution , has been shown to be possible under the Las Vegas model 



of randomisation (see |BDP05|). Nevertheless, a 3SUM-hardness result for a problem is a 
strong indication that finding an 0{s'^~'^) time solution is going to be a challenging task. 

Before we show that Shift-Ham and ShiftScale-Ham are both 3SUM-hard, we 
provide a brief but formal discussion about reductions and define 3SUM-hardness. 
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3.1 3SuM reductions 

Following the definitions of fcOoi'l where 3SUM-hardness was first introduced, we say 
that a problem A is g{s)-solvahle using a problem B if and only if every instance of A of 
size s can be solved using a constant number of instances of B of at most 0{s) size and 
0{g{s)) additional time. We denote this as A <^g(^g-jB. When g{s) is sufficiently small, 
lower bounds for A carry over to B. A problem B is 3SVM-hard if 3SUM <^r,(s } B and 



g{s) = o(s^~^) for some constant e > 0. In the definition of 3SUM-hardness of [G095| 
the requirement was that g{s) = o(s^), however, to scale with more powerful models of 
computation, we require that g{s) = o{s'^~^). If A <^g(^g-^B and B <^g(^g^A then we say 
that A and B are g{s)- equivalent. 

In the following section we will show that 3SUM <SS;siogs Shift-Ham where the 
instance size of Shift-Ham is a text of length n = 5s and a pattern of length m = 3s. 

In the literature there are a variety of definitions of the 3SUM problem. They differ 
only slightly in their formulations and are all equivalent. One common definition, used 



as the "base problem" in [G095|, is formulated as follows. Given a set of s integers, 
determine whether there are three elements a, 6, c in the set such that a + h + c = 
0. Without too much work, one can show that this definition is 0(s)-equivalei it with 
Definition [13 of 3SUM above (small modifications of t he proo f of Theorem 3.1 in |G095| 



can be used to prove this). Further, it was shown in |G095l | that GeomBase is 0(s) 
equivalent to 3 Sum. 



3.2 3SuM-hardness of Shift-Ham 

In this section we show that Shift-Ham is 3SUM-hard. 

Lemma 17. 3SuM <^siogs Shift-Ham where the instance size of Shift-Ham is a text 
of length 5s and a pattern of length 3s. 

Proof. Let the set S be an instance of 3SUM of size s = \S\. First we sort all elements 
of S so that S = {xi, . . . where xi < X2 < ■ ■ ■ < Xs. Let yi = 2xs + 1 and for 
i G {2, . . . , 2s}, let yi = yi-i + 1. Thus, < yi < • • • < y2s- We define the following 
s-length strings over the alphabet {xi, . . . , Xg} U {yi, . . . , y2s} U {0}. 

5-0 = • • • (s zeros) S'3 = y^+i ys+2 ■ ■ ■ y2s 

51 = Xl X2 ■ ■ ■ Xg Si = Xg Xg-l ■ ■ ■ Xl 

52 = yiy2 - ■ ■ Vs 

We now construct an instance of Shift-Ham specified by 

T = S'o||S'i||52||5'i||S'3 and P = S'4||S'o||S'o • 

The text T has length n = 5s and the pattern P has length m = 3s. First we show 
that if there are elements a,b,c £ S such that a + b = c then there is a position i such 
that the shift-normalised Hamming distance between P and T[i . . . i -\- m — 1] is at most 
m — 2. We will then show that if no such three elements exist then the shift-normalised 
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Hamming distance between P and every m-length substring of T is strictly greater than 
m — 2. 

As an illustrative example, suppose that S contains seven elements and suppose that 
X4 + X3 = xq. Consider the alignment of P and T where X4 in P is aligned with xq in T: 

T: xi X2 X3 14x5^^x7 yi y2 yi y5 ye xi X2^3^X4 X5 XQ X7 ys yayiiynyayayii 
P: X7 xn x^J^x-3. X2 XY 0000000|T|000000 

We observe that shifting the pattern by X3 will induce two matches, marked with the 
squares above. Thus, the shift-normalised Hamming distance is at most m — 2 (in fact, 
it is exactly m — 2). It should be easy to see how this generalises to any size of 5 and 
any three elements a,b,c £ S such that a + b = c. Namely, the alignment in which a is 
aligned with c has Hamming distance at most m — 2 since there must also be a match 
at the position where aligned with b. The construction of P and T ensures that there 
is always an alignment that captures these matches. 

Now suppose there are no elements a,b,c £ S such that a + b = c. Consider a fixed 
alignment of P and T. We will show that there can be at most one match under any shift. 
By construction of P and T, the zeros in P are all aligned with distinct symbols in T. 
Hence for any shift, at most one of these zeros can be involved in a match. The non-zero 
symbols of P (i.e., the s-length prefix of P) appear in strictly decreasing order and are 
aligned with an s-length substring of T whose elements appear in non-decreasing order. 
Therefore, under any shift, at most one of the non-zero symbols in P can be involved in 
a match. It remains to show that there is no shift such that both a zero and a non-zero 
symbol in P are simultaneously involved in a match. First, we observe that if there is 
a match between a in P and some yj in T then there can be no other match as every 
non-zero symbol in P is aligned with a value that is less than yj. Suppose therefore that 
there is a match between a in P and some xj in T (i.e., the shift is xj). We need to 
consider three possible cases: there is also a match that involves some Xk in P aligned 
with either (i) a in T, (ii) some yi in T or (iii) some xi in T. In case (i) the shift must be 
negative, hence is not compatible with the shift Xj. In case (ii) we can see that the shift 
must be greater that Xg (the largest elements in the set S), hence is not compatible with 
the shift Xj. In case (iii) we have that Xk + Xj = xe, which contradicts the assumption 
that there are no elements a,b,c £ S such that a + b = c. Thus, the shift-normalised 
Hamming distance is at least m — 1 for any alignment of P and T. 

Finally, we observe that the most time consuming part of the reduction is the sorting 
of S which could take O(slogs) time. This concludes the proof. □ 

Theorem 18. Shift-Ham has no 0{nm^^^) time algorithm, for any e > 0, conditional 
on the hardness of the 3SUM problem. 

Proof. Given a 3SUM instance of size s, by Lemma [T71 we construct a Shift-Ham instance 
of size n = 5s and m = 3s in O(slogs) time. If Shift-Ham has an 0{nm^'^) time 
algorithm then 3SUM can be solved in 0(s^~^) time. □ 



Notice that Shift-Ham has an 0(nm log m) time solution [MNU05|. See Section WA\ 
for details. 
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3.3 3SuM-hardness of ShiftScale-Ham 



In this section we show that ShiftScale-Ham is 3SUM-hard. 

Lemma 19. 3SUM ShiftScale-Ham where the instance size of Shift-Ham is a 
pattern and text of length s each. 

Proof We reduce from the GeomBase problem which is 0(s)-equivalent to 3SUM. Be- 
fore we describe the reduction we adopt a formulation of the GeomBase problem that 
differs slightly in notation. Instead of insisting on the points being on the horizontal lines 
y = 0, y = 1 and y = 2, we assume that the points are on the vertical lines x = 0, x = 1 
and X = 2 and we want to determine whether there is a (non-vertical) line containing 
three points. Under this formulation, let S be an instance of GeomBase that contains 
the integer points {xi,yi), (x2,y2), • • • , ixs,ys), where every xj £ {0, 1,2}. 

We construct an instance of ShiftScale-Ham that is specified by the text T = 
2/1 2/2 ■ ■ ■ ys and the pattern P = xiX2 ■ ■ ■ Xg, both of length s. It should now be clear 
that ShiftScale-Ham returns the shift-and-scale normalised Hamming distance s — 3 
(for the only alignment of P and T) if and only if there are two values a and /3 such 
that /3xj + a = yj for three distinct positions j, which is equivalent to fitting a line 
through three points. Note that we minimise a and /3 over the rationals, and any line 
going through three points is indeed specified by rational values of a and /3. Since the 
reduction takes linear time, we have proved the lemma. □ 

Theorem 20. ShiftScale-Ham has no 0{nm^~^) algorithm, for any e > 0, conditional 
on the hardness of the 3SUM problem. 

Proof. Given a 3SUM instance of size s, by Lemma [19] we construct a ShiftScale- 
Ham instance of size n = m = s in 0{s) time. If ShiftScale-Ham has an 0{nnn}~'^) 
algorithm then 3SUM can be solved in 0{s^~'') time. □ 

4 Normalised /c-mismatch under shifts 

In this section we consider two versions of the normalised /c-mismatch problem under 
shifts, defined as Problems [8] and [9] in the introduction. Both problems are parameterised 
by an integer k. In the first problem, Shift-/c-Mismatch, the output is the shift- 
normalised Hamming distance between P and T at every position for which the distance 
is k or less. Where the distance is larger than A;, only A; + 1 is outputted. Recall from the 
introduction that the shift-normalised Hamming distance between P and T[i . . .i + m — 1] 
is denoted d^(i) and defined by 

d+{i) = min I { j I a + P[j] / T[i 

In Section 14.21 we give a deterministic algorithm that solves Shift-A;-Mismatch in 
0{nk log k) time. 

In Section 14.31 we consider the the second version of shift-normalised A;-mismatch, 
Shift-A;-Decision, which unlike the previous problem only indicates with yes or no 
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whether the shift-normahsed Hamming distance is k or less. We give a randomised 
solution to this decision problem with the improved running time 0{cn\/k log k log n). 
The parameter c is a constant that can be chosen arbitrarily to fine tune the error 
probability. Namely, the probability that our algorithm outputs the correct answer at 
every alignment is at least (1 — l/nP). The errors are one-sided such the algorithm will 
never miss reporting an alignment for which the shift-normalised Hamming distance is 
indeed k or less. Our algorithm requires that k < ^J m/G, hence it is suited to situations 
where the locations of text substrings similar to the pattern are required but the distances 
themselves are not needed. 



4.1 The unbounded case 

In MNU05| . Makinen, Navarro and Ukkonen gave an 0{nm log m) time algorithm for the 
shift-normalised Hamming distance problem, Shift-Ham, which by definition solves the 
bounded, /c-mismatch variant in 0{nm\ogm) time also. We briefiy recap their method 
by way of an introduction. First observe that the maximum number of matches for any 
alignment is exactly 



m 



d+{i) = max{ j | T[i + j] - P[j] = a} . 



For each alignment i, this value can be obtained by creating an m- length array Ai, which 
we refer to as the shift array, defined by 

A,[j]=n + j]-P[j] (4) 

for all j E [m]. This shift array is then sorted to find the most frequent value, which is 
the a that minimises d^{i). The number of times it occurs is m — d^{i). Computing 
this requires 0(m log m) time per alignment and hence 0{nmlogm) time overall. In the 
next section we will reconsider Ai and demonstrate that it can be run-length encoded in 
0{k) runs whenever ^ k. 

4.2 A deterministic solution 

The determini stic al gorithm makes use of the notion of difference strings which were 
introduced in [moot and are defined as follows. 



Definition 21. Let 5 be a string of length s. The difference string of S, denoted Ss, is 
defined by 

Ss[j] = S[j + l]-S[j] 
for all j £ [s — 1]. The length of Ss is s — 1. 

We will also make use of a generalisation of the difference string when we present 
our randomised algorithm in Section [4.31 The core of our deterministic shift-normalised 
fe-mismatch algorithm is the relationship between the number of mismatches between Ps 
and Ts[i . . . i -\- m — 2] and the value of d^{i). We begin in Lemma [22] below by showing 
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that if d^(i) is small then the number of mismatches between the difference strings Ps 
and Ts is also small. In ^MNU05l] a related result was used to reduce the shift-normalised 
exact matching problem to the conventional exact matching problem. Specifically, they 
observed that in the special case that k = 0, the implication becomes an equivalence, 
i.e., d^{i) = if and only if -P^ = Ts[i . . . i + m — 2]. Unfortunately, this is not the case 
in general. 

Lemma 22. Let P be a pattern and T a text. For all i, 



Proof. Let i be such that d^{i) ^ k and therefore there exists an a such that for at most 
k distinct position j S [m] we have that P[j] + a ^ T[i+ j]. Further, at most 2k distinct 
positions j G [m — 1] have either P[j] + a ^ T[i + j] or P[j + 1] + a 7^ T[i + j + V\. 
This implies that there are at least (m —1) — 2k distinct positions j G [m — 1] such that 
P[j]+a = T[i + j] and P[j + l]+a = T[i+j + l]. By rearranging these equations, for any 
such j we have that P[j + 1] — P[j] = T[i + j + 1] — T[i + j] and hence by Definition [211 
Ps[j] = Ts[l+j]. As required there are at most 2k mismatches (recall \Ps\ = m — 1). □ 

Lemma [22] suggests the following strategy. First we find the leftmost up to 2A; + 1 
mismatches between Ps and Ts[i . . . i + m — 2] at each alignment i. By Lemma [22] we can 
disregard any alignments with more than 2k mismatches. Finally we use the locations of 
these mismatches to infer d^{i) at the remaining alignments. 

The first step can be done using any fe-mismatch (strictly 2fc-mismatch) algorithm 
wh ich re turns the locations of the mismatches. The well-known 'kangaroo' method 
of |LV8d | achieves this in optimal 0{nk) time. The method is so named as it uses 
longest common extensions to 'hop' between mismatches in constant time. The discard- 
ing phase is trivial and therefore we only focus on computing d'^{i) from the locations of 
the (at most 2k) mismatches between P^ and Ts[i . . . (i -|-m — 2)], where i is an arbitrary 
non-discarded alignment. 

Recall from Section 14.11 the definition of the shift array Ai in Equation (|H) , and recall 
that the value of m — d^{i) is the number of occurrences of the most frequent entry in A^. 
We will now use the locations of the mismatches between P^ and Ts[i . . . i+m—2] to obtain 
a run-length encoded version of Ai containing 0{k) runs. The key property we require 
is given in Lemma [23] which states that a matching substring in Pg and Ts[i .. .i + m — 2] 
corresponds to a run (a substring of equal values) in Ai. This immediately implies that 
Ai can be decomposed into at most ik + l runs. Specifically, one run of length 1 for each 
mismatch and an additional run for each stretch between mismatches. 

Lemma 23. IfPs[e...r]= Ts[{i + £)... {i + r)] then Ai[j] = Ai[£] for all j e {£,..., r}. 

Proof. Suppose that Ps[i . . .r] = Ts[{i + i) . . . {i + r)]. We proceed by induction on 
j € {i, ■ ■ ■ ,r}. The base case j = £ is tautologically true. For the inductive step, let 
j € {£ + 1, . . . , r}. By the inductive hypothesis, we have that Ai[j — 1] = T[i + j — 1] — 
P[j — 1] = Ai[£]. As Ps[j — 1] = Ts[i + j — 1], by Definition [21] (and rearranging the 
equation), we have Ai[j] = T[i + j] - P[j] = T[i + j - 1] - P[j - 1] = Ai[£]. □ 




Ham (P5 , Ts[i . . . {i + m 



2)]) < 2k. 



17 



Algorithm 3 Overview of deterministic solution to Shift-/c-Mismatch. 



1. Compute the difference strings Ps and Ts by scanning P and T. 

2. Run a 2/c-mismatch algorithm on P^ and Ts in order to find all alignments where 
the number of mismatches is at most 2k. The 2A;-mismatch algorithm must also 
return the locations of the mismatches at any alignment where there are at most 
2k mismatches. 

3. Discard all alignments with more than 2k mismatches. 

4. For each undiscarded alignment i, decompose Ai into at most 4A;+1 runs (substrings 
with a common value). The start and end points of the runs are determined by 
scanning the locations of the mismatches between Ps and Ts[i . . A + m — 1] . 

5. Sort the runs in Ai by value in order to find the most frequent entry a in j4j. Then 
output m — \{j I Ai[j] = a } I, which is the value d'^{i). 



In Section Hrl we discussed that the value of d'^{i) equals m — max^ | { j | Ai[j] = a } | , 
which could be found by sorting and scanning Ai in 0(?n log m) time. However, we now 
have Ai in run-length encoded form (with 0{k) runs), therefore the time taken to find 
d^{i) is reduced to 0{k log k). Over all alignments, this gives 0(nA;log k) time as desired. 

We can now give an overview of our deterministic algorithm for Shift-A;-Mismatch. 
The steps are described in Algorithm [3] and the overall running time is given in Theo- 
rem [23] below. 

Theorem 24. The shift-normalised k-mismatch problem (^SHIFT-A;-MlSMATCHj can be 
solved deterministically in 0{nklogk) time. 

Proof. The solution is outlined in Algorithm |3l Correctness follows directly from the dis- 
cussion in this section. The time complexity of the five steps is as follows. By inspection 
of the definition, the difference strings computed in Step 1 require 0(n) time. Step 2 
uses a 2A:-mismatch algorithm as a black box and can be performed in 0{nk) time by 



using for example the algorithm in [LV86|. Step 3 makes a single pass of the output of 
the 2A:-mismatch algorithm in 0{n) time. Step 4 constructs a run length encoded ver- 
sion of Ai for each undiscarded i. This requires scanning the 0{k) mismatches at each 
undiscarded alignment. Therefore Step 4 takes 0{nk) time. Step 5 scans and sorts each 
Ai which takes ©(/clog A;) time per alignment as Ai is encoded by 0{k) runs. Overall the 
algorithm requires 0(nA:logA;) time as claimed. □ 

4.3 An improved, randomised solution 

We now present an improved solution to the shift-normalised A:-mismatch problem which 
runs in 0{cnyjk log k log n) time. The improved algorithm is for the case that k < ^Jm/Q 
and is randomised. The errors are one-sided (false-positives) and it outputs the correct 
answer at all alignments with probability at least 1 — for any constant c. For each 
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position i, the algorithm gives a yes/no answer to the question "is d^(i) ^ kT\ The 
algorithm does not output the actual distance d^{i). Throughout this section, we use Tj 
as shorthand for T[i . . . i + m — 1]. 

In Section 14.21 our deterministic algorithm made use of the locations of mismatches 
in the difference strings and Ts[i . . . i + m — 1]. Recall that the difference string Ss 
was defined to give the differences between consecutive positions in a string S. That is, 
Ss[j] = S[j + 1] — S[j] for all j. A key observation was that Ps[j] = Ts[i + j] if and 
only if P[j] — T[i + j] = P[j + 1] — T[i + j + 1] = —a, i.e., the positions of P[j] and 
P[j + 1] require the same shift a to match. However, there is no reason to consider only 
consecutive differences. In fact, as we will see, one may consider differences under any 
arbitrary permutation of the position set. This notion is formalised as follows. 

Definition 25. Let 5 be a string of length s and vr : [m] — t- [m] be a permutation. The 
permuted difference string of S under vr, denoted S-^, is defined by 

sAj] = s['K{j)]-s[j] 

for all j £ [s]. The length of is s. 

Note that the permuted difference string St^ has length \S\ in contrast to the difference 
string Ss of Definition [2T] which has length |5| — 1. 

The central idea of our improved algorithm is to use the value of Ham(P^, (Tj)^) 
to directly determine whether d^{i) ^ k at each alignment i. In Definition [26l below 
we introduce the notion of a permutation being k-tight for some P, Tj. Intuitively, vr is 
fc-tight for P,Ti if we can infer directly from Ham(P^, (Ti).,^) whether d^{i) ^ k. 

Definition 26. Let vr be a permutation, P a pattern and Tj a text substring. We say 
that vr is k-tight for P, Ti if 

d^{i) < k Ham(P^, (Tj)^) ^ 2k . 

It would of course be highly desirable to find a permutation vr which is A:-tight for all 
P, Ti and any k. However, we will see that this is in general not possible. 

We begin by showing that any vr has the property that d'^{i) ^ k implies that 
Ham(P^, {Ti).,^) ^ 2k for all P, Tj. To do so we first prove a general lemma which will 
also be useful later. Lemma [28] then gives the desired property and is a generalisation of 
Lemma [22] to arbitrary permutations. 

Lemma 27. Let tt be a permutation, P a pattern and Ti a text substring. For all j G [m], 

p[j]-T,[j] = p[7Tij)]-TMj)] ^ PAj] = m)Aj]- 

Proof. The left-hand side of the arrow is the same as P[vr(j)] — P[j] = Tj[vr(j)] — Tj[j], 
which by Definition [23 is equivalent to the right-hand side of the arrow. □ 

Lemma 28. Let it be a permutation, P a pattern and Ti a text substring. 

d+{i) i^k ^ Ham(P^, (Tj)^) ^ 2k . 
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Proof. Let P and T, be such that d^{i) ^ k. By definition there exists an a such that 
the set J = { j \ P[j] + a ^ } has size at most k. As tt is a permutation, there 
are at most 2k positions j G [m] such that either j £ J or 7r(j) G J. Therefore, for 
all (at least) m — 2k remaining positions j' G [m] we have that P[j'] + a = Ti[j'] and 
P[7r{j')] + a = Tilirlj')]. For each such position j' , by rearranging the two equations it 
follows from Lemma [27] that -P7r[j'] = (Tj),rb']- Thus, there are at most 2k mismatches 
between Pj^ and (T^)^. □ 

A logical next step would be to attempt to find a permutation vr with the property 
that Ham(P,r) {Ti)^) ^ 2A; implies that d^{i) ^ k for all P, Tj. Unfortunately, Lemma [291 
below shows that no such permutation can exist. As Corollary I30lstates. this immediately 
implies that there is no permutation which is /c-tight for all P, Tj. Instead we will select 
our permutation at random and show that we can obtain a permutation that is A:-tight 
for a given P, Ti with constant probability. 

Lemma 29. Let vr he any permutation and 6 ^ k < m/4. There exists a pattern P and 
text substring Ti such that 

d^{i) > k and Kam^Pj^, (Ti)^^) ^ 2k. 

Proof. We define P to be an m-length string of zeros. In order to define the m-length 
string Ti we first introduce some notation. 

Let k' = [k/2\ + 1. We identify a set of k' locations io,£i, . . . , Ik'-i £ [m] as follows. 
Location = 0. For q £ {1, . . . ,k' — 1}, location £q is the smallest position in [m] that 
is not any of the preceding locations io, . . . ,iq-i or any location that is mapped to or 
from by any of these preceding locations (under tt). Formally, ig is the smallest location 
which is not in the set Lg = | igi, 7r{£qi), 7r~^{iqi) | q' E [q] |. Observe that the set Lg 
has size at most 3k' ^ 3{k/2 + 1) < 3k < m (since k < m/4), hence such a location 
always exists. 

We can now define Tj as follows. For all q G [k'], let Ti[ig] = 1 and Ti[TT(iq)] = 1. At all 
other locations j, Ti[j] = 0. Observe that by construction, the locations Iq, . . . , £k'-i 
7r(^o)) • • • ;7r(^fc'-i) are all distinct. Therefore, Tj contains exactly 2k' ones and m — 2k' 
zeros. As 2k' ^ k + 2 < m/2, more than half the locations have Tj[j] = P[j] = 0, and 
therefore d^{i) is minimised by the shift a = 0. Thus, d^{i) = 2k' > k. 

We proceed by showing that the alignment of P^^ and {Ti)^^ contains at least m — 3k' 
matches. There are m — 2k' locations j in Tj such that Tj[j] = 0. Of these locations, at 
most 2k' have rj[7r(j)] = 1. Therefore, there are at least m — 4A;' locations j such that 
Ti[j] = Ti[iT{j)] = 0. As P[j] = P[TT{j)\ = 0, we have by Lemma [27] that P^[j] = {T,)^[j] 
at m — Ak' locations. Now consider locations ig for q S [k']. By construction, Ti[£(^ = 
Ti\'K{£g)] = 1 and therefore P7r[^g] = {Ti)Tr[^q\ by Lemma [27] This implies a further k' 
matching locations. There are therefore at least at least m — 3k' matches or at most 3k' 
mismatches between P^ and [Ti)-,^. Since 3k' ^ 3{k/2 + 1) ^ 2k for all A; ^ 6 we have 
that Ham(P^, (T^)^) < 2k. □ 

Corollary 30. Let vr he any permutation and 6 ^ k < m/4. There exists a pattern P 
and text substring Ti for which it is not k-tight. 
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Proof. Immediate from Definition 1261 and Lemma [29l 



□ 



4.3.1 Random permutations 

We will choose a permutation uniformly at random from a simple family of permutations. 
On first inspection, we could have chosen from the family of all permutations. We claim 
without proof that a permutation chosen uniformly at random from the family of all 
permutations is fc-tight for any P, Tj with constant probability. However, we must be 
able to efficiently compute Kam^P.,^, (Tj),^) for all i under our chosen permutation. The 
key problem being that in general (Tj),^ is not easily obtained from T. As i varies, (Tj)^ 
could change drastically, even when i is only incremented by one. Therefore we must be 
careful in selecting our family of permutations. 

We will use the family of cyclic permutations, denoted C-m (for patterns of length m), 
defined as follows. 

Definition 31. The set Cm contains the m — 1 cyclic permutations vri, 7r2, . . . , VTm-i, 
where 

T^qU) = j + gmodm. 

We now show in Lemma [22] that Cm has the desired property of A:-tightness when 
m > 6k^. There is a corner case when k G {0, 1} which is easily solved in 0(n) time 
using our deterministic algorithm from Section For Lemma l32l we require that k ^ 2. 

Lemma 32. Let P be a pattern and Ti a text substring. When m > 6k^ and k ^ 2, 

I { vr I vr G Cm is k-tight for P, Tj } I 1 



Proof. Let p = | { vr | vr G Cm is A;-tight for P, Tj } |/|Cm|- We will show that p ^ 1/6. 
Note that |C ml — m — 1. We let h — d^[i) be the minimal number of mismatches between 
P and Ti, and 3 be the shift which minimises d^{i). 

Assume first that h ^ k. By Lemma [28] and Definition [26] we have that that every 
vr G C m is A;-tight for P, Tj and therefore p — 1. Assume second that that h > k. We 
split the proof into three cases: 

m m 
Case 1. k < h^2k Case 2. 2k < h ^ — Case 3. — < h 

3 3 

First we introduce some notation. There are exactly m—h positions j where S+P[j] = 
Ti[j]. We call such a position an S-match. Similarly, any position with a+P[j] = Ti[j] for 
some a is called an a-match. Positions which are not a-matches are called a- mismatches. 
Hence there are h distinct S-mismatches. We will refer to 7r(j) as the position that j is 
mapped to (by vr). 

Case 1 {k < h ^ 2k). Let j be an arbitrary S-mismatch. Position j is mapped to 
another 3- mismatch in exactly h — 1 distinct permutations of Cm- This holds for each 
of the h distinct 3-mismatches. Hence there are at most (h — l)h permutations under 
which some 3-mismatch is mapped to another 3-mismatch. The remaining (at least) 
(m — 1) — (h — l)h permutations TT in C^TTi immediately have the following two properties: 
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(i) if position j is an a-mismatch then 7r(j) is an a-match; 

(ii) if position 7r(j) is an S-mismatch then position j is an S-match. 

There are h positions j with property (i) and another (disjoint) h positions j with prop- 
erty (ii). That is, for each S-mismatch there are two positions j that meet one of the two 
properties above. By Lemma [271 each such j imphes that P-Trij] 7^ (^j)7rb]- Therefore, 
in each of these (m — 1) — (h — l)h permutations vr, Ham(P,rj (^O^r) ^ 2h > 2k and so 
each such permutation is A;-tight for P, Ti. By the assumption of Case 1, h ^ 2k, and the 
assumptions that m ^ G/c^ and A; ^ 2, we have that (m — 1) — (/i — l)h ^ m — Ak"^ ^ m/3. 
Thus, p ^ (m/3)/(m - 1) > 1/6. 

Case 2 {2k < h ^ m/3). Let K be an arbitrary set of 2k distinct 3- mismatches. For 
any permutation tt, let 

= {j\ vr(i) GK}. 

We define 

Hamj^(P^, (Ti)^) = \ { j \ j e {K U K'^) A / (T,).[i] } | 

to be the number of mismatch positions between Pt^ and (Tj),^ that are also in K or K^"^. 
We now consider the total number of mismatches between Pt^ and (Tj)^ (that are in K 
or K~'^) summed over all permutations in Cm- Let 

Hk{P,T,)= Ham^(P^,(T,).). 

Since /i ^ m/3 by the assumption of Case 2, there are at least 2m/ 3 S-matches. A 
permutation tt that maps a position j G K to an a-match creates a mismatch P7r[i] ^ 
iTi)n[j] by Lemma [27] (as j is an S-mismatch). For a fixed j G K, the number of 
permutations in that niajp j to cin Q-nicLtcli equcils the number of tt-nicttches, which 
is at least 2m/ 3. Thus, the set K of 2k S- mismatches contributes at least 2k • (2m/3) to 
HK(.P,Ti). 

Similarly, any position j which is an S-match creates a mismatch Pn[j] 7^ iPi)n[j] 
by Lemma [27] if it is mapped to an S-mismatch in K. This occurs under exactly 2k 
permutations. Recall that any j which is mapped to a position in K under vr belongs 
to K~^. Therefore, given that there are at least 2m/3 S-matches, the contribution is at 
least 2k ■ (2m/3) further distinct mismatches to HxiP^Ti). 

Summing up the previous two paragraphs, we have shown that Hk{P, Ti) ^ {8/3)mk. 
Each permutation vr that is not /c-tight for P,Ti has Ham(P7r; iTi)v) ^ 2k (since h > k). 
Therefore, m • 2A; is a generous upper bound on the number of mismatches across all 
permutations which are not /c-tight. This leaves at least {8/3)mk — 2mk = {2/3)mk 
mismatches among the /c-tight permutations of Cm- Since \K U K^'^\ ^ 4/c, we have 
that Ham;<'(P^, (Tj)^) ^ Ak for any tt, hence each permutation contributes at most 4A; 
mismatches to HxiP^Ti). Therefore there are at least {2 / 3)mk / [Ak) = m/6 distinct 
fe-tight permutations. Thus, p ^ {m/6)/{m — 1) ^ 1/6. 
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Case 3 (m/3 < h). Similarly to Case 2, we consider the total number of mismatches 
between P^r and (Tj)^ summed over all permutations in Cm- Let 

H{P,Ti) = ^ Ham(P^,(ri).). 

Since h > m/3 the number of a-mismatches is more than m/3 for all a. Fix an arbitrary 
position j and choose an a such that j is an a-match. There are at least m/3 permu- 
tations TT in Cfn. that map position j to an Q-mismatch. By Lemma I27[ P7r[j] 7" (-^Oti"!-?'] 
for each of these permutations. Hence position j will contribute with at least m/3 to 
H[P,Ti). By considering all m positions j, we have that H{P,Ti) ^ m • (m/3). 

Similarly to the reasoning in Case 2, each permutation vr that is not fc-tight for P, Ti 
has Ham(P^, (Tj),^) ^ 2fc (since h > k). Again, m • 2A; is a generous upper bound on 
the number of mismatches across all permutations which are not fc-tight. This leaves 
at least m?/3 — 2mk mismatches among the A;-tight permutations of Cm- As certainly 
Ham(P7r) iTi)n) ^ we have that there are at least m/3 — 2k distinct fc-tight permuta- 
tions for P,Ti. Therefore, 

m/3-2A; k-1 1 

p ^ — > ^ - , 

m — 1 3A; 6 

where the second inequality follows from m > 6k^ and the last inequality from k ^ 2, 
both assumptions in the statement of the lemma. □ 

4.3.2 The algorithm 

Before describing the randomised algorithm we turn our attention to the problem of 
finding all positions i G [n — m + 1] such that Ham(P^, {Ti)^^) ^ 2k under an arbitrary 
cyclic permutation vr G Cm- We will describe a simple deterministic algorithm that 
computes Ham(P7r; {Ti)^) by reduction to the conventional fe-mismatch problem. 

Let TTq G Cm be a fixed but arbitrary permutation {q G [1, . . . , m — 1]). Recall that 
TTq{j) = j + q mod m. We define 

P+ = P,JO...(m -<?-!)], 
P- =P.,[{m-q)...{m-l)]. 

Thus, Pj^^ = We have \P^\ = m — q and \Pq~\ = q. Now define and T~ such 

that 

T+[j]=T[j + q]-T[j], 

Tq [j] = T[j + q-m]- T[j] , 

for all j G [n] (except those that take the indices "out of range"). Observe that 

(7^)77, = T^[i - - - {i + m - q - 1)] \\ Tq-Ki + m - q) - - - {i + m - 1)] , 
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Algorithm 4 Overview of randomised solution to Shift-Zc-Decision. 

1. Pick a cyclic permutation vr^ G Cm uniformly at random. 

2. Construct the strings P^, , and T^. 

3. Run a 2/c-mismatch algorithm on the pairs (P^,T^) and (P^ ,T^) as a black box. 

4. Using the results from Step 3 and Equation compute Ham(Pn-^, (Tj),^^) for all i. 

5. Any alignment i with Ham(Pn-q) (^i)^,) ^ 2A; is declared to have d^{i) ^ fc. 



where the first substring has length m — q and the second substring has length q. From 
these definitions it now follows directly that 

Ham(P^^,(ri).J = Ham(P+,T+[i...(i + m-g-l)]) (5) 
+ Ham(p-, T" [{i + m - q) . . . {i + m - 1)]) . 

Thus, in order to determine which positions i have Ham(P7rg, (rj),^,) ^ 2fc, we first 
construct P^ , P~ , and T", and then we run a standard 2A;- mismatch algorithm on 
the pairs [P^ ,T^) and {P~ ,T~) and use the previous formula. 

We can now finally give an overview of our randomised algorithm for the Shift-A;- 
Decision problem. The steps are described in Algorithm S] The overall running time 
and proof of correctness is given in Theorem [33] below. The algorithm makes one-sided 
errors and outputs a false match (incorrectly reports d^{i) ^ k) with constant probability 
per alignment. As we will see in the proof of Theorem [33l by running the algorithm a 
logarithmic number of times drastically reduces the probability of an error occurring at 
one or more alignments. 

Theorem 33. For any choice of constant c, SHiFT-fc-DECISION can he solved randomised 
in 0{cny/k log k log n) (deterministic) time when k < ^/rri/6. The algorithm makes only 
false-positive errors (incorrectly declares the Hamming distance is at most k). With 
probability at least 1 — \/n^, the algorithm is correct at every alignment. 

Proof. As discussed in Section 14.3.11 if A; G {0, 1} then we can use the deterministic 
algorithm from Section |4?2] and achieve time complexity of 0(n) and no errors. Therefore, 
we focus on the case that k ^ 2. 

We first consider correctness. It follows from the discussion above that Algorithm H] 
does indeed determine, for every alignment i, whether Ham(P7r^, (Tj),^^) ^ 2A;. We first 
show that 

(i) Ham(P^^, (Ti)^,) ^ 2k when d+{i) ^ k; 

(ii) the probability that Ham(P^^, {Ti)^^) ^ 2A; when d^{i) > A; is at most 5/6. 

By Lemma [281 we have that if d^{i) ^ k then Ham(P7rg, (ri)7rq) ^ 2A;. This proves 
property (i). By Definition 1261 Ham(P7rg, (Ti)-,^^) > 2k if d^{i) > k for all permutations 
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TTq that are fc-tight for P, Tj. The permutation TTq is selected uniformly at random from 
Cm in Step 1, hence by Lemma [32] it is /c-tight for P,Ti with probability at least 1/6. 
This proves property (ii). Note that we can apply Lemma [32] since we have assumed that 
m > 6k^ and k ^ 2. 

As Algorithm [J] only makes false-positive errors, we can amplify the probability of 
giving correct outputs by repeating the algorithm. We repeat it 4(c + l)[logn] times, 
where c is a constant, and output any alignment which is reported by all repeats. More 
precisely, let i be some alignment such that d'^{i) > k. The probability that one run 
of Algorithm 2] incorrectly reports position z as a match is at most 5/6. Thus, the 
probability that all runs output i as a match is at most 

(5/6)^('=+^)riognl ^ (-^/2)(^+^)'°s" < 1/n'^+i. 

By the union bound over all positions i, the probability of the multi-run algorithm 
outputting a false match in at least one alignment is at most n ■ = Xj-nP as 

required. 

We now consider the time complexity of Algorithm l4] (without amplification) . Step 1 
requires only constant time to pick a permutation at random. Step 2 requires 0(n) time 
by inspection of the definitions. Step 3 makes two calls to a 2A:-mismatch algorithm. For 
both calls the input is a pattern of length 0{m) and a te xt of length 0{n). Using the 



fastest known /c-mismatch algorithm of Amir et al. |ALP04] |. this step takes 0{n^/k\o^ k) 
time. Steps 4 and 5 require only scanning the output of Step 3 and therefore take 0(n) 
time. This gives a time complexity of 0{nyJk\og k) time. However, we repeat the 
algorithm 0(c log n) times to reduce the error probability, hence 0{cnyjk log k log n) is 
the total time complexity. □ 



5 Discussion 

We have shown how to derive both new upper and lower bounds for a variety of pattern 
matching problems under polynomial transformations. In some cases we have improved 
on known results and in others introduced new problem definitions and solutions. There 
remain however a number of open questions. First, we suspect that the true complexity 
of POLY-r-L2 is unresolved, particularly for higher polynomial transformations. For 
example, when r = m there exists a straightforward 0{nm) time solution by considering 
the problem independently at each alignment. It is also still uncertain if the normalised 
Hamming distance problem is 3SUM-hard for polynomials of degree greater than one. 
For Shift-A;-Decision, our fast randomised algorithm applies only when k < \J m/Q. 
However, our lower bound for the same problem applies to the case where we want to 
determine if the Hamming distance is at most m — 2. This leaves a range of values of k 
where the complexity is not yet determined. It is also an interesting question whether 
our randomised solution can be efficiently modified to output the Hamming distance at 
each alignment rather than simply a decision about whether it is greater or less than k or 
indeed if a new fast method can be found for this problem which will allow the presence 
of wildcards in the input. 
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